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Abstract 

Bitmap indexes must be compressed to reduce input/output costs and minimize 
CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, 
we use techniques based on run-length encoding (RLE), such as Word-Aligned 
Hybrid (WAH) compression. These techniques are sensitive to the order of the 
rows: a simple lexicographical sort can divide the index size by 9 and make 
indexes several times faster. We investigate row-reordering heuristics. Simply 
permuting the columns of the table can increase the sorting efficiency by 40%. 
Secondary contributions include efficient algorithms to construct and aggregate 
bitmaps. The effect of word length is also reviewed by constructing 16-bit, 
32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are 
slightly faster than 32-bit indexes despite being nearly twice as large. 
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1. Introduction 

Bitmap indexes are among the most commonly used indexes in data ware- 
houses (HE]. Without compression, bitmap indexes can be impractically large 
and slow. Word- Aligned Hybrid (WAH) [3] is a competitive compression tech- 
nique: compared to LZ77 jl] and Byte- Aligned Bitmap Compression (BBC) [5], 
WAH indexes can be ten times faster [BJ. 

Run-length encoding (RLE) and similar encoding schemes (BBC and WAH) 
make it possible to compute logical operations between bitmaps in time propor- 
tional to the compressed size of the bitmaps. However, their efficiency depends 
on the order of the rows. While we show that computing the best order is 
NP-hard, simple heuristics such as lexicographical sort are effective. 

Table [I] compares the current paper to related work. Pinar et al. [5], Sharma 
and Goyal [7 , and Canahuate et al. [10] used row sorting to improve RLE and 
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WAH compression. However, their largest bitmap index could fit uncompressed 
in RAM on a PC. Our data sets are 1 million times larger. 

Our main contribution is an evaluation of heuristics for the row ordering 
problem over large data sets. Except for the naive 2-switch heuristic, we re- 
view all previously known heuristics, and we consider several novel heuristics 
including lexicographical ordering, Gray-Frequency, partial sorting, and column 
reorderings. Because we consider large data sets, we can meaningfully address 
the index construction time. Secondary contributions include 

• guidelines about when "unary" bitmap encoding is preferred (§[8|; 

• an improvement over the nai've bitmap construction algorithm — it is now 
practical to construct bitmap indexes over tables with hundreds of millions 
of rows and millions of attribute values (see Algorithm [lj ; 

• an algorithm to compute important Boolean operations over many bitmaps 
in time 0((Yli=i 1^1) l°g-^) where Ylf=i \Bi\is the total size of the bitmaps 
(see Algorithm [3]) ; 

• the observation that 64-bit indexes can be slightly faster than 32-bit in- 



dexes on a 64-bit CPU, despite file sizes nearly twice as large (see § 7.121 



The last two contributions are extensions of the conference version of this pa- 
per [IT]. 

The remainder of this paper is organized as follows. We define bitmap in- 
dexes in § [2j where we also explain how to map attribute values to bitmaps using 
encodings such as k-oi-N. We present compression techniques in §[3j In §|4] we 
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consider the complexity of the row-reordering problem. Its NP-hardness moti- 
vates use of fast heuristics, and in §[5j we review sorting-based heuristics. In §[6] 
we analyze k-oi-N encodings further to determine the best possible encoding. 
Finally, §[7] reports on several experiments. 

2. Bitmap indexes 

We find bitmap indexes in several database systems, apparently beginning 
with the MODEL 204 engine, commercialized for the IBM 370 in 1972 [T2]. 
Whereas it is commonly reported [13] that bitmap indexes are suited to small 
dimensions such as gender or marital status, they also work over large dimen- 
sions [3J [2]. And as the number of dimensions increases, bitmap indexes be- 
come competitive against specialized multidimensional index structures such as 
R-trees [15]. 

The simplest and most common method of bitmap indexing associates a 
bitmap with every attribute value v of every attribute a; the bitmap represents 
the predicate a = v. Hence, the list cat, dog, cat, cat, bird, bird becomes the 
three bitmaps 1,0,1,1,0,0, 0,1,0,0,0,0, and 0,0,0,0,1,1. For a table with n rows 
(facts) and c columns (attributes/dimensions), each bitmap has length n; ini- 
tially, all bitmap values are set to 0. Then, for row j, we set the j th compo- 
nent of c bitmaps to 1. If the i th attribute has rii possible values, we have 
L = J2i=i n i bitmaps. 

We expect the number of bitmaps in an index to be smaller than the number 
of rows. They are equal if we index a row identifier using a unary bitmap in- 
dex. However, we typically find frequent attribute values |16j . For instance, 
in a Zipfian collection of n items with N distinct values, the item of rank 
k 6 {1, . . . , N} occurs with frequency . - . The least frequent item has 

frequency , - and we have that Y^iLi Vi* — 1- Setting . - > 1 

and assuming N large, we have N s < n, so that N < i/n. Hence, for highly 
skewed distributions (s > 2), the number of distinct attribute values N is much 
smaller than the number of rows n. 

Bitmap indexes are fast, because we find rows having a given value v for 
attribute a by reading only the bitmap corresponding to value v (omitting the 
other bitmaps for attribute a), and there is only one bit (or less, with compres- 
sion) to process for each row. More complex queries are achieved with logical 
operations (AND, OR, XOR, NOT) over bitmaps and current microprocessors 
can do 32 or 64 bitwise operations in a single machine instruction. 

Bitmap indexes can be highly compressible: for row j, exactly one bitmap 
per column will have its j th entry set to 1. Although the entire index has nL bits, 
there are only nc l's; for many tables, L>c and thus the average table is very 
sparse. Long (hence compressible) runs of 0's are expected. 

Another approach to achieving small indexes is to reduce the number of 
bitmaps for large dimensions. Given L bitmaps, there are L(L — l)/2 pairs of 
bitmaps. So, instead of mapping an attribute value to a single bitmap, we map 
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Table 2: Example of 1-of-N and 2-of-N encoding 

Montreal 100000000000000 110000 

Paris 010000000000000 101000 

Toronto 001000000000000 100100 

New York 000100000000000 100010 

Berlin 000010000000000 100001 

them to pairs of bitmaps (see Table [2]) . We refer to this technique as 2-of-iV 
encoding [17]; with it, we can use far fewer bitmaps for large dimensions. For 
instance, with only 2 000 bitmaps, we can represent an attribute with 2 million 
distinct values. Yet the average bitmap density is much higher with 2-of-iV 
encoding, and thus compression may be less effective. More generally, k-of-N 
encoding allows L bitmaps to represent (^) distinct values; conversely, using 

L = [fcnj^ fc ] bitmaps is sufficient to represent i%i distinct values. However, 
searching for a specified value v no longer involves scanning a single bitmap. 
Instead, the corresponding k bitmaps must be combined with a bitwise AND. 
There is a tradeoff between index size and the index speed. 

For small dimensions, using k-of-N encoding may fail to reduce the number 
of bitmaps, but still reduce the performance. For example, we have that N > 
(T) > (3) > (1) for ^ < 4, so that 1-of-JV is preferable when N < 4. We 
choose to limit 3-of-iV encoding for when N > 6 and 4-of-JV for when N > 8. 
Hence, we apply the following heuristic. Any column with less than 5 distinct 
values is limited to l-oi-N encoding (simple or unary bitmap). Any column 
with less than 21 distinct values, is limited to k G {1, 2}, and any column with 
less than 85 distinct values is limited to k € {1, 2, 3}. 

Multi-component encoding |4 works similarly to k-of-N encoding in reduc- 
ing the number of bitmaps: we factor the number of attribute values n — or a 
number slightly exceeding it — as n = n\n 2 . . . n Kl with rii > 1 for all i. Any 
number i £ {0,1, ... ,n — 1} can be written uniquely in a mixed-radix form 

as i = n + qir 2 + q\qsXi + h r k qiq 2 . . . q K -i where ^ € {0,1,...,^- 1}. 

We use a particular encoding scheme (typically 1-of-JV) for each of the n val- 
ues 7T, r 2 , . . . , r K representing the i th value. Hence, using Y2i=i Qi bitmaps we 
can code n different values. Compared to k-of-N encoding, multi-component 
encoding may generate more bitmaps. 

Lemma 1. Given the same number of attribute values n, k-of-N encoding never 
uses more bitmaps than multi- component indexing. 

Proof. Consider a qi, q 2 , . . . , g K -component index. It supports up to n = 
nr=i Qi distinct attribute values using Qi bitmaps. For n — Yii=i Qi fixed, 

we have that X)i=i Qi ^ s minimized when qi = tfn for all i, hence y^L-, qj > 
[«;-y/n]. Meanwhile, (^) > (N/k) k ; hence, by picking N = \n-tfn\, we have 
CT) ^ n - Thus, with at most X^=i Qi bitmaps we can represent at least n 
distinct values using k-of-N encoding (k — n, N — \n^/n\), which shows the 
result. 
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To further reduce the size of bitmap indexes, we can bin the attribute val- 
ues [T51 - I2T] . For range queries, Sinha and Winslett use hierarchical binning [22] . 

3. Compression 

RLE compresses long runs of identical values: it replaces any repetition by 
the number of repetitions followed by the value being repeated. For example, 
the sequence 11110000 becomes 4140. The counter values (e.g., 4) can be stored 
using variable- length counters such as gamma [23] or delta codes. With these 
codes, any number x can be written using O(logcc) bits. Alternatively, we 
can used fixed- length counters such as 32-bit integers. It is common to omit 
the counter for single values, and repeat the value twice whenever a counter is 
upcoming: e.g., 1011110000 becomes 10114004. 

Current microprocessors perform operations over words of 32 or 64 bits and 
not individual bits. Hence, the CPU cost of RLE might be large [33]. By trad- 
ing some compression for more speed, Antoshenkov [5] defined a RLE variant 
working over bytes instead of bits (BBC). Trading even more compression for 
even more speed, Wu et al. [3J proposed WAH. Their scheme is made of two 
different types of wordfj^] The first bit of every word is true (1) for a running 
sequence of 31-bit clean words (0x00 or lxll), and false (0) for a verbatim (or 
dirty) 31-bit word. Running sequences are stored using 1 bit to distinguish be- 
tween the type of word (0 for 0x00 and 1 for lxll) and 30 bits to represent the 
number of consecutive clean words. Hence, a bitmap of length 62 containing 
a single 1-bit at position 32 would be coded as the words 100x01 and 010x00. 
Because dirty words are stored in units of 31 bits using 32 bits, WAH compres- 
sion can expand the data by 3%. We studied a WAH variant that we called 
Enhanced Word-Aligned Hybrid (EWAH): in a technical report, Wu et al. [25] 
called the same scheme Word- Aligned Bitmap Code (WBC). Contrary to WAH 
compression, EWAH may never (within 0.1%) generate a compressed bitmap 
larger than the uncompressed bitmap. It also uses only two types of words (see 
Fig. [I]), where the first type is a 32-bit verbatim word. The second type of word 
is a marker word: the first bit indicates which clean word will follow, half the 
bits (16 bits) are used to store the number of clean words, and the rest of the 
bits (15 bits) are used to store the number of dirty words following the clean 
words. EWAH bitmaps begin with a marker word. 

3.1. Comparing WAH and EWAH 

Because EWAH uses only 16 bits to store the number of clean words, it 
may be less efficient than WAH when there are many consecutive sequences of 
2 16 identical clean words. The seriousness of this problem is limited because 
tables are indexed in blocks of rows which fit in RAM: the length of runs does 
not grow without bounds even if the table does. In § |7.3[ we show that this 



For simplicity, we limit our exposition to 32 bit words. 
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a) an example bitmap being compressed 
1 000000001 1 1 000001 1 00001 1 1 00001 1 
0000000000000....00000000000000000 

ooi 1 1 m 1 1 1 1 oooooooooooooi noooi 

b) dividing the bitmap into 32-bit groups 
1 000000001 1 1 000001 1 00001 1 1 00001 1 
0000000000000....00000000000000000 
001 1 1 1 1 1 1 1 1 1 OOOOOOOOOOOOOI 1 1 0001 

c) EWAH encoding 
00000000000000000000000000000001 
1 000000001 1 1 000001 1 00001 1 1 00001 1 
00001 01 01 0001 0000000000000000001 
001 1 1 11 1 1 1 11 OOOOOOOOOOOOOI 1 1 0001 



I— number of clean words: 1 6 bits 
1 — type of the clean words: 1 bit 

number of dirty words following clean words: 1 5 bits 



(5456 bits) 
32 bits 
5392 bits 
32 bits 

group 1:32 bits 

group 2-1 75: 174*32 bits 

group 176:32 bits 

marker-word 
dirty word 
marker-word 
dirty word 



Figure 1: Enhanced Word-Aligned Hybrid (EWAH) 



overhead on compressing clean words is at most 14% on our sorted data sets — 
and this percentage is much lower (3%) when considering only unsorted tables. 
Furthermore, about half of the compressed bitmaps are made of dirty words, on 
which EWAH is 3% more efficient than WAH. 

We can alleviate this compression overhead over clean words in several ways. 
On the one hand, we can allocate more than half of the bits to encode the runs 
of clean words |25j . On the other hand, when a marker word indicates a run 
of 2 16 clean words, we could use the convention that the next word indicates 
the number of remaining clean words. Finally, this compression penalty is less 
relevant when using 64-bit words instead of 32-bit words. 

When there are long runs of dirty words in some of the bitmaps, EWAH 
might be preferable — it will access each dirty word at most once, whereas a WAH 
decoder checks the first bit of each dirty word to ascertain it is a dirty word. 
An EWAH decoder can skip a sequence of dirty words whereas a WAH decoder 
must access them all. For example, if we compute a logical AND between a 
bitmap containing only dirty words, and another containing very few non-zero 
words, the running time of the operation with EWAH compression will only 
depend on the small compressed size of the second bitmap. 

When there are few dirty words in all bitmaps, WAH might be preferable. 
Even considering EWAH and WAH indexes of similar sizes, each EWAH marker 
word needs to be accessed three times to determine the running bits and two 
running lengths, whereas no word needs to be accessed more than twice with 
WAH. 

3.2. Constructing a bitmap index 

Given L bitmaps and a table having n rows and c columns, we can naively 
construct a bitmap index in time 0(nL) by appending a word to each com- 
pressed bitmap every 32 or 64 rows. We found this approach impractically slow 
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when L was large — typically, with k — 1. Instead, we construct bitmap indexes 
in time proportional to the size of the index (see Algorithm [T]) : within each 
block of w rows (e.g., w — 32), we store the values of the bitmaps in a set — 
omitting any unsolicited bitmap, whose values are all false (0x00). We use the 
fact we can add several clean words of the same type to a compressed bitmap 
in constant time. 

Our implementation is able to generate the index efficiently on disk, even 
with extremely large tables and millions of (possibly small) compressed bitmaps, 
using horizontal partitioning: we divide the table's rows into large blocks, such 
that each block's compressed index fits in a fixed memory budget (256 MiB). 
Each block of bitmaps is written sequentially [26 and preceded by an array of 
4-byte integers containing the location of each bitmap within the block. 



Algorithm 1 Constructing bitmaps. For simplicity, we assume the number of 
rows is a multiple of the word size. 

Construct: B\, . . . , -Bl, L compressed bitmaps 

length(Bi) is current (uncompressed) length (in bits) of bitmap Bi 

w is word length in bits, a power of 2 (e.g., w = 32) 

u>i <— for 1 < i < L. 

c <— 1 {row counter} 

M <— {N records the dirtied bitmaps} 

for each table row do 

for each attribute in the row do 

for each bitmap i corresponding to the attribute value do 
set to true the (c mod w) th bit of word Wj 
TV ^ TV u {i} 
if c is a multiple of w then 
for i in N do 

add c/w — length(_Bi) — 1 clean words (0x00) to Bi 
add the word u>i to bitmap Bi 

LOi <— 

c^- c+ 1 
for i in {1,2,. . . ,L} do 

add c/w — \Bi \ — 1 clean words (0x00) to Bi 



3.3. Faster operations over compressed bitmaps 

Beside compression, there is another reason to use RLE: it makes operations 
faster [5] . Given (potentially many) compressed bitmaps Bi , . . . , of sizes 
\Bi\, Algorithm [2] computes Af =1 5; and Vf =1 Bi in timfQ 0(LY,i \Bi\). For 
BBC, WAH, EWAH and all similar RLE variants, similar algorithms exists: we 
only present the results for traditional RLE to simplify the exposition. 



2 Unlcss otherwise stated, we use RLE compression with ui-bit counters. In the complexity 
analysis, we do not bound the number of rows n. 
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Indeed, within a given pass through the main loop of Algorithm[2] we need to 
compute the minimum and the maximum between L w-bit counter values which 
requires O(L) time. Hence, the running time is determined by the number of 
iterations, which is bounded by the sum of the compressed sizes of the bitmaps 

For RLE with variable-length counters, the runs are encoded using log n bits 
and so each pass through the main loop of Algorithm [2] will be in 0(L log n), 
and a weaker result is true: the computation is in time 0(LJ2 { \Bi\logn). We 
should avoid concluding that the complexity is worse due to the logrt factor: 
variable-length RLE can generate smaller bitmaps than fixed-length RLE. 



Algorithm 2 Generic 0(LJ2i 1-6*1) algorithm to compute any bitwise opera- 
tions between L bitmaps. We assume the L-ary bitwise operation, 7, itself is in 

0{L). 

INPUT: L bitmaps Bi,...B L 
Ii <— iterator over the runs of identical bits of Bi 
r <— representing the aggregate of B\, . . . Bl (initially empty) 
while some iterator has not reached the end do 

let a 1 be the maximum of all starting values for the runs of Ii , 
let a be the minimum of all ending values for the runs of I\ , . . 
append run [a' , a] to T with value determined by 7(Ji, . . . , II) 
increment all iterators whose current run ends at a. 



A stronger result is possible if the bitwise operation is updatable in 0(log L) 
time. That is, given the result of an updatable L-ary operation 7(61, 62, • • • , 6l), 
we can compute the updated value when a single bit is modified (6^), 

7(61,62, ■ • • M-i^'iM+i, ■ ■ • 

in 0(log L) time. All symmetric Boolean functions are so updatable: we merely 
maintain a count of the number of ones, which (for a symmetric function) de- 
termines its value. Symmetric functions include AND, OR, NAND, NOR, XOR 
and so forth. For example, given the number of 1-bits in a set of L bits, we 
can update their logical AND or logical OR aggregation ( Af =1 bi, \/f =1 bi) in 
constant time given that one of the bits changes its value. Fast updates also 
exist for functions that are symmetric except that specified inputs are inverted 
(e.g., Horn clauses). 

From Algorithm [3j we have the following lemma. (The result is presented for 
fixed-length counters; when using variable-length counters, multiply the com- 
plexity by logn.) 

Lemma 2. Given L RLE-compressed bitmaps of sizes \B\\ 1 1-B2I, ■ ■ ■ , \Bl\ o,nd 
any bitwise logical operation computable in 0(L) time, the aggregation of the 
bitmaps is in time OQ3i=i |-Si|L). // the bitwise operation is updatable in 
O(logL) time, the aggregation is in time 0(^2f =1 \Bi\logL). 
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Algorithm 3 Generic OQ2i \Bi\logL) algorithm to compute any bitwise op- 
erations between L bitmaps updatable in O(logL) time. 

INPUT: L bitmaps B 1 ,...B L 

Ii <— iterator over the runs of identical bits of Bi 

r <— representing the aggregate of B\, . . . Bl (initially empty) 

7 be the bit value determined by j(Ii, . . . , II) 

H' is an L-element max-heap storing starting values of the runs (one per bitmap) 
H is an L-element min-heap storing ending values of the runs and an indicator of 
which bitmap 

a table T mapping each bitmap to its entry in H' 
while some iterator has not reached the end do 

let a be the maximum of all starting values for the runs of Ji , . . . , II, determined 

from H' 

let a be the minimum of all ending values for the runs of Ii, . . . , II, determined 
from H 

append run [a' , a] to V with value 7 

for iterator Ii with a run ending at a (selected from H) do 
increment U while updating 7 in O(logL) time 
pop a value from H, insert new ending run value to H 

from hash table, find old starting value in H' , and increase it to the new starting 
value 



B, 

B 2 
B, 



B, 





,(a, B 2 ) 



y= xor(l,l,l,...,0)=l 



Figure 2: Algorithm [3] in action. 



Corollary 1. This result is also true for word-aligned (BBC, WAH or EWAH) 
compression. 

See Fig. [2] where we show the XOR of L bitmaps. This situation depicted 
has just had J 2 incremented, and 7 is about to be updated to reflect the change 
of L?2 from ones to zeros. The value of a will then be popped from H, whose 
minimum value will then be the end of the I\ run. Table T will then allow us 
to find and increase the key of LVs entry in H' , where it will become a + 1 and 
likely be promoted toward the top of H' . 
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In the rest of this section, we assume an RLE encoding such that the merger 
of two running lengths reduces the total size (0 repeated x times and repeated 
y times, becomes repeated x + y times). These encodings include BBC, WAH 
and EWAH. We also consider only fixed-length counters; for variable-length 
counters, the running time complexity should have the bitmap index size mul- 
tiplied by log n. 

From Algorithm [3] we have that | A ieS B t \ < \J2ies B *\> I v *es B ^\ < 
I Sies and so on for other binary bitwise operation such as ©. This bound 
is practically optimal: e.g., the logical AND of the bitmaps 10. . . 10 (n runs) 
and 11. . . 11 (1 run) is 10. . . 10 (n runs). 

Hence, for example, when computing B\ A B 2 A B3 A + • • • A Bl we may 
start with the computation of 61 A B 2 = -B1.2 in 0(|B1| + \B2\) time. The 
bitmap B12 is of size at most \Bi\ + I-B2I, hence B\2 A B3 can be done in time 
O ( I B x I + 1 B 2 1 + 1 B 3 1 ) . Hence , the t ot al running time is in O (£f =1 (L — i + 1 ) 1 5; | ) . 

Hence, there are at least three different generic algorithms to aggregate a 
set of L bitmaps for these most common bitwise operations: 

• We use Algorithm^ which runs in time 0((Y2f=i \ B i\) logL). It generates 
a single output bitmap, but it uses two L-elemcnt heaps. It works for a 
wide range of queries, not only simple queries such as \/f =1 Bi. 

• We aggregate two bitmaps at a time starting with B\ and B2, then ag- 
gregating the result with B3, and so on. This requires time 0(%2f =1 (L — 
i + l)|Bj|). While only a single temporary compressed bitmap is held in 
memory, L — 1 temporary bitmaps are created. To minimize processing 
time, the input bitmaps can be sorted in increasing size. 

• We can store the bitmaps in a priority queue |27j . We repeatedly pop 
the two smallest bitmaps, and insert the aggregate of the two bitmaps. 
This approach runs in time 0((%2f=i log i), and it generates L — 
1 intermediate bitmaps. 

• Another approach is to use in-place computation |27j : (1) an uncom- 
pressed bitmap is created in time 0(n) (2) we aggregate the uncompressed 
bitmap with the each one of the compressed bitmaps (3) the row IDs are 
extracted from the uncompressed bitmap in time 0(n). For logical OR 
(resp. AND) aggregates, the uncompressed bitmap is initialized with ze- 
roes (resp. ones). The total cost is in O(Ln): L passes over the uncom- 
pressed bitmap will be required. However, when processing each com- 
pressed bitmap, we can skip over portions of the uncompressed bitmaps 
e.g., when we compute a logical OR, we can omit runs of zeroes. If the 
table has been horizontally partitioned, it will be possible to place the 
uncompressed bitmap in main memory. 

We can minimize the complexity by choosing the algorithm after loading the 
bitmaps. For example, to compute logical OR over many bitmaps with long 
runs of zeroes — or logical AND over many bitmaps with long runs of ones — 
an in-place computation might be preferable. When there are few bitmaps, 



10 



computing the operation two bitmaps at a time is probably efficient. Otherwise, 
using Algorithm [3] or a priority queue [37] might be advantageous. Unlike the 
alternatives, Algorithm [3] is not limited to simple queries such as vf =1 Bi. 

4. Finding the best reordering is NP-Hard 

Let d(r, s) be the number of bits differing between rows r and s. Our problem 
is to find the best ordering of the rows so as to minimize d(ri, rj+i). 
Pinar et al. have reduced the row-reordering problem to the Traveling Salesman 
Problem (TSP) [9l Theorem 1] using d as the distance measure. Because d 
satisfies the triangle inequality, the row-reordering problem can be approximated 
with 1.5-optimal cubic-time algorithms [25]. Pinar and Heath [29] proved that 
the row-reordering problem is NP-Hard by reduction from the Hamiltonian path 
problem. 

However, the hardness of the problem depends on L being variable. If the 
number L of bitmaps were a constant, the next lemma shows that the problem 
would not be NP-harcj^J an (impractical) linear-time solution is possible. 

Lemma 3. For any constant number of bitmaps L, the row-reordering problem 
requires only 0(n) time. 

Proof. Suppose that an optimal row ordering is such that identical rows do 
not appear consecutively. Pick any row value any sequence of L bits appearing 
in the bitmap index — and call it a. Consider two occurrences of a, where one 
occurrence of the row value a appears between the row values b and c: we may 
have b = a and/or c = a. Because the Hamming distance satisfies the triangle 
inequality, we have d(b, c) > d(b, a) + d(a, c). Hence, we can move the occurrence 
of a from between b and c, placing it instead with any other occurrence of a — 
without increasing total cost, d(rj,rj+i). Therefore, there is an optimal 
solution with all identical rows clustered. 

In a bitmap index with L bitmaps, there are only 2 L different possible dis- 
tinct rows, irrespective of the total number of rows n. Hence, there are at 
most (2 L )l solutions to enumerate where all identical rows are clustered, which 
concludes the proof. 

If we generalize the row-reordering problem to the word-aligned case, the 
problem is still NP-hard. We can formalize the problem as such: order the 
rows in a bitmap index such that the storage cost of any sequence of identical 
clean words (0x00 or lxll) costs w bits whereas the the cost of any other word 
is w bits. 

Theorem 1. The word-aligned row-reordering problem is NP-hard if the num- 
ber of bits per word (w) is a constant. 



3 Assuming P ^ NP. 
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Proof. Consider the case where each row of the bitmap is repeated w times. 
It is possible to reorder these identical rows so that they form only clean words 
(lxll and 0x00). There exists an optimal solution to the word-aligned row- 
reordering problem obtained by reordering these blocks of w identical rows. 
The problem of reordering these clean words is equivalent to the row-ordering 
problem, which is known to be NP-hard. 



5. Sorting to improve compression 

Sorting can benefit bitmap indexes at several levels. We can sort the rows 
of the table. The sorting order depends itself on the order of the table columns. 
And finally, we can allocate the bitmaps to the attribute values in sorted order. 

5.1. Sorting rows 

Reordering the rows of a compressed bitmap index can improve compression. 
Whether using RLE, BBC, WAH or EWAH, the problem is NP-hard (see §[§. 
A simple heuristic begins with an uncompressed index. Rows (binary vectors) 
are then rearranged to promote runs. In the process, we may also reorder 
the bitmaps. This is the approach of Pinar et al. [9], Sharma and Goyal [7], 
Canahuate et al. [TO], and Apaydin et al. [H], but it uses Q.{nL) time. For the 
large dimensions and number of rows we have considered, it is infeasible. A more 
practical approach is to reorder the table, then construct the compressed index 



5.31 



directly (see § 5.2.2 1; we can also reorder the table columns prior to sorting (see 



Sorting lexicographically large files in external memory is not excessively 
expensive [30] 131] • With a memory buffer of M elements, we can sort almost 
M 2 elements in two passes. 

Several types of ordering can be used for ordering rows. 

• In lexicographic order, a sequence 121,02, . . . is smaller than another se- 
quence 61, 62, ■ • ■ if and only if there is a j such that a,j < bj and a, = bi 
for i < j. The Unix sort command provides an efficient means of sorting 
flat files into lexicographic order; in under 10 s our test computer (see §|7| 
sorted a 5-million-line, 120 MB file. SQL supports lexicographic sort via 
ORDER BY. 

• We may cluster runs of identical rows. This problem can be solved with 
hashing algorithms, by multiset discrimination algorithms [32], or by a lex- 
icographic sort. While sorting requires £l(n log n) time, clustering identical 
facts requires only linear time (O(n)). However, the relative efficiency of 
clustering decreases drastically with the number of dimensions. The rea- 
son is best illustrated with an example. Consider lexicographically-sorted 
tuples (a, a), (a, b), (b, c), (b, d). Even though all these tuples are distinct, 
the lexicographical order is beneficial to the first dimension. Random 
multidimensional row clustering fails to cluster the values within columns. 
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• Instead of fully ordering all of the rows, we may reorder rows only within 



disjoint blocks (see § 7.4 1. Block- wise sorting is not competitive. 



• Gray-code (GC) sorting, examined next. 

GC sorting is defined over bit vectors [9]. The list of 2-of-4 codes in increasing 
order is 0011, 0110, 0101, 1100, 1010, 1001. Intuitively, the further right the 
first bit is, the smaller the code is, just as in the lexicographic order. However, 
contrary to the lexicographic order, the further left the second bit is, the smaller 
the code is. Similarly, for a smaller code, the third bit should be further right, 
the fourth bit should be further left and so on. Formally, we define the Gray- 
code order as follows. 

Definition 1. The sequence ai, <22, • • • is smaller than b\, 62, . . . if and only if 
there exists j such tha^ aj = ai © 02 © . . . © dj-i, bj 7^ aj, and ai = hi for 
i < j. 

We denote this ordering by < gc , as opposed to the normal lexicographic 
ordering, <i cx . The reflexive versions of these are < gc and <i cx , respectively. 

Algorithm [4] an adaptation of Ernvall's procedure [33j [34] to sparse data, 
shows how to compare sparse GC bit vectors V\ and Vi in time 0(min(|wi|, |t> 2 |) 
where is the number of true value in bit vector v^. Sorting the rows of a 
bitmap index without materializing the uncompressed bitmap index is possi- 
ble: we implemented an 0(nck log n)-time solution for k-of-N indexes using an 
external-memory B-tree [35 (c is the number of columns). As values, we used 
the rows of the table, and as keys, we used the position of the ones in the bitmap 
row as 32-bit integers — some of our indexes have half a million bitmaps. Hence, 
we used 4cfc bytes per row for storing the keys alone. Both keys and values were 
compressed using LZ77 to minimize I/O costs — compression improved perfor- 
mance noticeably in our informal tests. We expected that this implementation 
would be significantly slower than lexicographic sorting, but the degree of dif- 
ference surprised us: our implementation proved to be two orders of magnitude 
slower than lexicographic sort using the Unix sort command. 



For some of our tests (see § 7.9 1, we wish to rearrange the order of the 
bitmaps prior to GC sorting. We get this result by applying the appropriate 
permutation to the positions that form the B-tree keys, during the B-tree's 
construction. 

For RLE, the best ordering of the rows of a bitmap index minimizes the sum 
of the Hamming distances: ^ i h(ri, r^i) where is the i th row, for h(x,y) = 

{i\xi ^ yi} . If all 2 L different rows are present, the GC sort would be an 

optimal solution to this problem [9j. The following proposition shows that GC 
sort is also optimal if all (^) k-of-N codes are present. The same is false of 
lexicographic order when k > 1: 0110 immediately follows 1001 among 2-of-4 
codes, but their Hamming distance is 4. 



4 The symbol © is the XOR operator. 
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Algorithm 4 Gray-code less comparator between sparse bit vectors 

INPUT: arrays a and b representing the position of the ones in two bit 
vectors, a' and b' 
OUTPUT: whether a' < gc b' 
f <— true 

m <— min(length(a), length(fo)) 
for p in 1, 2, ... } m do 

return f if a p > b p and ->/ if a p < & p 

./' • ./' 

return ->/ if length(a) > length(6), / if length(fe) > length(a), and false 
otherwise 



Proposition 1. We can enumerate, in GC order, all k-of-N codes in time 
0(k( k )) (optimal complexity). Moreover, the Hamming distance between suc- 
cessive codes is minimal (—2). 

Proof. Let a be an array of size k indicating the positions of the ones in k-of-N 
codes. As the external loop, vary the value a\ from 1 to N — k + 1. Within this 
loop, vary the value a 2 from N — k + 2 down to a\ + 1. Inside this second loop, 
vary the value of 03 from 02 + 1 up to N — k + 3, and so on. By inspection, we 
see that all possible codes are generated in decreasing GC order. To see that 
the Hamming distance between successive codes is 2, consider what happens 
when ai completes a loop. Suppose that i is odd and greater than 1, then 
had value N — k + i and it will take value a^-i + 1. Meanwhile, by construction, 
a.; + i (if it exists) remains at value N — k + i + 1 whereas remains at value 
N — k + i + 2 and so on. The argument is similar if i is even. 

For encodings like BBC, WAH or EWAH, GC sorting is suboptimal, even 
when all k-of-N codes are present. For example consider the sequence of rows 
1001, 1010, 1100, 0101, 0101, 0110, 0110, 0011. Using 4-bit words, we see that 
a single bitmap contains a clean word (0000) whereas by exchanging the fifth 
and second row, we get two clean words (0000 and 1111). 

5.2. Sorting bitmap codes 

For a simple index, the map from attribute values to bitmaps is inconse- 
quential; for k-of-N encodings, some bitmap allocations are more compressible: 
consider an attribute with two overwhelmingly frequent values and many other 
values that occur once each. If the table rows are given in random order, the 
two frequent values should have codes that differ in Hamming distance as little 
as possible to maximize compression (see Fig. [3] for an example). However, it 
is also important to allocate bitmaps well when the table is sorted, rather than 
randomly ordered. 

There are several ways to allocate the bitmaps. Firstly, the attribute val- 
ues can be visited in alphabetical or numerical order, or — for histogram-aware 
schemes — in order of frequency. Secondly, the bitmap codes can be used in 
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Figure 3: Two bitmaps representing the sequence of values a,b,a,b,b,a using different codes. 
If codes have a Hamming distance of two (right), the result is more compressible than if the 
Hamming distance is four (left). 



different orders. We consider lexicographical ordering (1100, 1010, 1001, 0110, 
. . . ) and GC order (1001, 1010, 1100, 0101, . . . ) ordering (see proof of Propo- 
sition [TJ . 

Binary-Lex denotes sorting the table lexicographically and allocating bitmap 
codes so that the i th attribute gets the i th numerically smallest bitmap code, 
when codes are viewed as binary numbers. Gray-Lex is similar, except that the 
i th attribute gets the rank-i bitmap code in GC order. (Binary-Lex and Gray- 
Lex coincide when k — 1.) These two approaches are histogram oblivious — they 
ignore the frequencies of attribute values. 

Knowing the frequency of each attribute value can improve code assignment 
when k > 1. Within a column, Binary-Lex and Gray-Lex order runs of iden- 
tical values irrespective of the frequency: the sequence af cccadeaceabe may 
become aaaabccccdeeef . For better compression, we should order the attribute 
values —within a column — by their frequency (e.g., aaaacccceeebdf ). Allocat- 
ing the bitmap codes in GC order to the frequency-sorted attribute values, our 
Gray- Frequency sorts the table rows as follows. Let /(a^) be the frequency of 
attribute a^. Instead of sorting the table rows a\, 02, . . . , a^, we lexicograph- 
ically sort the extended rows /(ai), ai, /(a^), 02, ... , f{ad), &d (comparing the 
frequencies numerically.) The frequencies /(a,) are discarded prior to indexing. 

5.2.1. No optimal ordering when k > 1 

No allocation scheme is optimal for all tables, even if we consider only lexi- 
cographically sorted tables. 

Proposition 2. For any allocation C of attribute values to k-of-N codes, there 
is a table where C leads to a suboptimal index. 

Proof. Consider a lexicographically sorted table, where we encode the second 
column with C. We construct a table where C is worse than some other ordering 
C'. The first column of the table is for attribute A\, which is the primary sort 
key, and the second column is for attribute A 2 . Choose any two attribute values 
v\ and v 2 from A2 , where C assigns codes of maximum Hamming distance (say 
d) from one another. If A2 is large enough, d > 2. Our bad input table has 
unique ascending values in the first column, and the second column alternates 
between v\ and V2- Let this continue for w rows. On this input, there will be 
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d bitmaps that are entirely dirty for the second columrj^J Other bitmaps in the 
second column are made entirely of identical clean words. 

Now consider C , some allocation that assigns v\ and vi codewords at Ham- 
ming distance 2. On this input, C produces only 2 dirty words in the bitmaps 
for A 2 . This is fewer dirty words than C produced. 

Because bitmaps containing only identical clean words use less storage than 
bitmaps made entirely of dirty words, we have that allocation C will compress 
the second column better. This concludes the proof. 

5.2.2. Gray-Lex allocation and GC-ordered indexes 

Despite the pessimistic result of Proposition[2j we can focus in choosing good 
allocations for special cases, such as dense indexes (including those where most 
of the possible rows appear) , or for typical sets of data. 

For dense indexes, GC sorting is better 9 at minimizing the number of 
runs, a helpful effect even with word-aligned schemes. However, as we already 
pointed out, the approach used by Pinar et al. [9] requires il(nL) time. For 
technical reasons, even our more economical B-tree approach is much slower 
than lexicographic sorting. As an alternative, we propose a low-cost way to GC 
sort k-of-N indexes, using only lexicographic sorting and Gray-Lex allocation. 

We now examine Gray-Lex allocation more carefully, to prove that its re- 
sults are equivalent to building the uncompressed index, GC sorting, and then 
compressing the index. 

Let 7i be the invertible mapping from attribute i to the ki-of-N code — 
written as an A^-bit vector. Gray-Lex implies a form of monotonicity: for a 
and a' belonging to the i th attribute, Ai, a < a' Ji(a) < gc li{o!)- The overall 
encoding of a table row r = (oi, a-z, • • • , a c ) is obtained by applying each ji to 
Oj, and concatenating the c results. I.e., r is encoded into 

7l( a l) 72(02) 7c(»c) 

r(r) = (ax,a2, . . . a^jONt+i, ■ ■ ■ oln x +n 2 , ■ ■ - oll-n^+i, ■ ■ ■ oll) 

where ai G {0, 1} for all i. 

First, let us assume that we use only k-of-N codes, for k even. Then, the 
following proposition holds. 

Proposition 3. Given two table rows r and r' , using Gray-Lex k-of-N codes 
for k even, we have r <\ cx r' T(r) < gc r(r'). The values of k and N can 

vary from column to column. 

Proof. We write r = (01, . . . , a c ) and r' = (a[, . . . , a' c ). We note (ai, ... ,ai) — 
r(r) and (a[, . . . ,a' L ) = T(r'). Without loss of generality, we assume T(r) < gc 
r(r'). First, if T(r) = T(r'), then r = r' since each 7^ is invertible. 



5 There are other values in A2 and if we must use them, let them occur once each, at the 
end of the table, and make a table whose length is a large multiple of w. 
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We now proceed to the case where T(r) < gc T(r'). Since they are not equal, 
Definition [l] implies there is a bit position t where they first differ; at position 
t, we have that a t — ot\ a-i • • • © att-i- Let t denote the index of the 
attribute associated with bitmap t. In other words, Nx + N 2 + ■ ■ ■ + A^_j < t < 
Ni + N2 + ■ ■ ■ + Nf. Let t' be the first bitmap of the block for attribute t; i.e., 
t' = N 1 + N 2 + --- + Nf_ 1 + l. 



L(r)< gc r(r') 

at = ©■=! on A a t ? a' t A A!=J(«< = <A) Dcf - □ 
<^=> a>t = ®\=i cti 0*I t / a, A a f 7^ 

A Ai=Ti ( a » = a i) A A*=t' ( a i = a i) associativity 

a ( = 0® 0*=t< a i A a t 7^ all codes are k-oi-N 

A A*=i ( a i = A Ki=t'( a i = a i) and fc is even 

A Ai=i( a i = a 't) A Ai=i'( a i = a i) 7» is invertible 

7 f (of) < gc 7f(Of) A Afci(°i = a D Def -[H 

<^=^ a t - <i ox a'- A Ai=i( a i — a i) 7 s are monotone 



t 

T <icx f' Def. of lex. order 



□ 



If some columns have fcj-of-iVj codes with ki odd, then we have to reverse 
the order of the Gray-Lex allocation for some columns. Define the alternating 
Gray-Lex allocation to be such that it has the Gray-Lex monotonicity (a < a' =>• 
7,(a) < gc 7i(a')) when 25=1 1S even, and is reversed (a < a' 7,(a) > gc 
7i(a')) otherwise. Then we have the following lemma. 

Lemma 4. Given a table to be indexed with alternating Gray-Lex k-of-N en- 
coding, the following algorithms have the same output: 

• Construct the bitmap index and sort bit vector rows using GC order. 

• Sort the table lexicographically and then construct the index. 
The values of k and N can vary from column to column. 

This result applies to any encoding where there is a fixed number of 1-bits 
per column. Indeed, in these cases, we are merely using a subset of the k-of-N 
codes. For example, it also works with multi-component encoding where each 
component is indexed using a unary encoding. 

5.2.3. Other Gray codes 

In addition to the usual Gray code, many other binary codes have the prop- 
erty that any codeword is at Hamming distance 1 from its successor. Thus, they 
can be considered "Gray codes" as well, although we shall qualify them to avoid 
confusion from our standard ("reflected") Gray code. 
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(a) 3-bit 



reflected GC 



(b) swap 



columns 1&3 



(c) then invert column 3 



Figure 4: The 3-bit reflected GC and two other Gray codes obtained from it, first by ex- 
changing the outermost columns, then by inverting the bits in the third column. 

Trivially, we could permute columns in the Gray code table, or invert the bit 
values in particular columns (see Fig. [4]). However, there are other codes that 
cannot be trivially derived from the standard Gray code. Knuth [36l § 7.2.1.1] 
presents many results for such codes. 

For us, three properties are important: 

1. Whether successive k-of-N codewords have a Hamming distance of 2. 

2. Whether the final codeword is at Hamming distance 1 from the initial 
codeword. Similarly, whether the initial and final k-of-N codewords are at 
Hamming distance 2. 

3. Whether a collection of more than 2 successive codes (or more than 2 suc- 
cessive k-of-N codes) has a small expected "collective Hamming distance". 
(Count 1 for every bit position where at least two codes disagree. ) 

The first property is important if we are assigning k-of-N codes to attribute 
values. 

The second property distinguishes, in Knuth's terminology, "Gray paths" 
from "Gray cycles." It is important unless an attribute is the primary sort key. 
E.g., the second column of a sorted table will have its values cycle from the 
smallest to the largest, again and again. 

The third property is related to the "long runs" property |36l 137] of some 
Gray codes. Ideally, we would want to have long runs of identical values when 
enumerating all codes. However, for any L-bit Gray cycle, every code word 
terminates precisely one run, hence the number of runs is always 2 L . Therefore, 
the average run length is always L. The distribution of run lengths varies by 
code, however. When L is large, Goddyn and Grozdjak show there are codes 
where no run is shorter than L — 31og 2 L; in particular, for L = 1024, there is a 
code with no run shorter than 1000 [36, 37 . In our context, this property may 
be unhelpful: with k-of-N encodings, we are interested in only those codewords 
of Hamming weight k. Also, rather than have all runs of approximately length 
L, we might prefer a few very long runs (at the cost of many short ones). 

One notable Gray code is constructed by Savage and Winkler [35], henceforth 
Savage- Winkler (see also KnuthJ3S] p. 89]). It has all k-of-N codes appearing 
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Figure 5: Effect of various orderings of the k-of-N codes. Top left: Number of runs of length 
x, 2-of-8 codes. For legibility, we omit counts above 30. Goddyn-Grozdjak GC had 42 runs 
of length 1, binary had 32, and random had 56. Binary, reflected GC and Savage- Winkler 
had a run of length 15, and reflected GC and binary had a run of length 22. Remainder: 
Probability that a randomly-chosen bit from an uncompressed index falls in a run of length 
x or more. Goddyn-Grozdjak and the random ordering were significantly worse and omitted 
for legibility. In |5(c)| when the techniques differed, reflected GC was best, then binary, then 
Savage- Winkler GC. 



nearly together — interleaved with codes of Hamming weight k — 1 or k + 1. 
Consequently, successive k-of-N codes have Hamming distance 2 — just like the 
common/reflected Gray codes. 

The run-length distributions of the various codes are heavily affected when 
we limit ourselves to k-of-N codes. This is illustrated by Fig. [5] where we 
examine the run lengths of the 2-of-8 codewords, as ordered by various Gray 
codes. The code noted as Goddyn-Grozdjak was obtained by inspecting a figure 
of Knuth [351 Fig- 14d]; some discussion in the exercises may indicate the code 
is due to Goddyn and Grozdjak [37]. 



From Fig. 5(a) we see that run-length distributions vary considerably be- 
tween codes. (These numbers are for lists of k-of-N codes without repetition; in 
an actual table, attribute values are repeated and long runs are more frequent.) 
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Figure 6: Probabilities that a bitmap will contain a dirty word, when several consecutive (how 
many: x-axis) of 1000 possible distinct k-of-N codes are found in a 32-row chunk. Effects are 
shown for values with k-of-N codes that are adjacent in reflected GC, Savage-Winkler GC, 
binary or random order. 



Both Goddyn-Grozdjak GC and the random listing stand out as having many 
short runs. However, the important issue is whether the codes support many 
sufficiently long runs to get compression benefits. 

Suppose we list all k-of-N codes. Then, we randomly select a single bit 
position (in one of the N bitmaps). Is there a good chance that this bit position 
lies within a long run of identical bits? For 2-of-8, 3-of-20 and 4-of-14, we 
computed these probabilities (see Fig. |5(b)[ |5(c) and |5(d)| . Random ordering 



and the Goddyn-Grozdjak GC ordering were significantly worse and they have 
been omitted. From these figure, we see that standard reflected Gray-code 
ordering is usually best, but ordinary lexicographic ordering is often able to 
provide long runs. Thus, we might expect that binary allocation will lead to 
few dirty words when we index a table. 

Minimizing the number of dirty words. For a given column, suppose that in a 
block of 32 rows, we have j distinct attribute values. We computed the average 
number of bitmaps whose word would be dirty (see Fig. [6j where we divide by 
the number of bitmaps). Comparing k-of-N codes that were adjacent in GC or- 
dering against k-of-N codes that were lexicographically adjacent, the difference 
was insignificant for k = 2. However, GC ordering is substantially better for 
k > 2, where bitmaps are denser. The difference between codes becomes more 
apparent when many attribute values share the same word. Savage-Winkler 
does poorly, eventually being outperformed even by lexicographic ordering. Se- 
lecting the codes randomly is disastrous. Hence, sorting part of a column — even 
one without long runs of identical values — improves compression for k > 1. 

5.3. Choosing the column order 

Lexicographic table sorting uses the i th column as the i th sort key: it uses 
the first column as the main key, the second column to break ties when two 



20 



rows have the same first component, and so on. Some column orderings may 
lead to smaller indexes than others. 

We model the storage cost of a bitmap index as the sum of the number of 
dirty words and the number of sequences of identical clean words (lxll or 0x00). 
If a set of L bitmaps has x dirty words, then there are at most L + x sequences 
of clean words; the storage cost is at most 2x + L. This bound is tighter for 
sparser bitmaps. Because the simple index of a column has at most n 1-bits, it 
has at most n dirty words, and thus, the storage cost is at most 3n. The next 
proposition shows that the storage cost of a sorted column is bounded by 5n.j. 

Proposition 4. Using GC-sorted consecutive k-of-L codes, a sorted column 
with Hi distinct values has no more than 2m dirty words, and the storage cost 
is no more than Arii + \kny k ~\ . 

PROOF. Using \kn X J kl \ bitmaps is sufficient to represent rij values. Because 
the column is sorted, we know that the Hamming distance of the bitmap rows 
corresponding to two successive and different attribute values is 2. Thus every 
transition creates at most two dirty words. There are rii transitions, and thus 
at most 2ni dirty words. This proves the result. 

For k = 1, Proposition [4] is true irrespective of the order of the values, 
as long as identical values appear sequentially. Another extreme is to assume 
that all 1-bits are randomly distributed. Then sparse bitmap indexes have 
~ 5(r, L,n) = (1 — (1 — -£^) w )^ dirty words where r is the number of 1-bits, L 
is the number of bitmaps and w is the word length (w = 32). Hence, we have an 
approximate storage cost of 28 (r, L, n) + [~fcn* ] . The gain of column C is the 
difference between the expected storage cost of a randomly row-shuffled C, minus 
the storage cost of a sorted C. We estimate the gain by 25(kn, \kn\^ kl \ , n) — An-i 
(see Fig. [7]) for columns with uniform histograms. The gain is modal: it increases 
until a maximum is reached and then it decreases. The maximum gain is reached 
at w (n(w — l)/2) k ^ k+i ': for n — 100, 000 and w — 32, the maximum is reached 
at f» 1 200 for k = 1 and at w 13 400 for k — 2. Skewed histograms have a lesser 
gain for a fixed cardinality rij. 

Lexicographic sort divides the z th column into at most n\ni ■ ■ ■ rij_i sorted 
blocks. Hence, it has at most 2n\ ■ ■ -m dirty words. When the distributions are 
skewed, the i th column will have blocks of different lengths and their ordering 
depends on how the columns are ordered. For example, if the first dimension 
is skewed and the second uniform, the short blocks will be clustered, whereas 
the reverse is true if columns are exchanged. Clustering the short blocks, and 
thus the dirty words, increases compressibility. Thus, it may be preferable to 
put skewed columns in the first positions even though they have lesser sorting 
gain. To assess these effects, we generated data with 4 independent columns: 



using uniformly distributed dimensions of different sizes (see Fig. 8(a) ) and using 
same-size dimensions of different skew (see Fig. |8(b)"j ). We then determined the 
Gray-Lex index size — as measured by the sum of bitmap sizes — for each of the 
4! different dimension orderings. Based on these results, for sparse indexes 
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Figure 7: Storage gain in words for sorting a given column with 100, 000 rows and various 
number of attribute values (2<5(fcn, \kn 1 J k '\,n) — 4n^ ). 
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(a) Uniform histograms with cardinalities 200, (b) Zipfian data with skew parameters 1.6, 
400, 600, 800 1.2, 0.8 and 0.4 



Figure 8: Sum of EWAH bitmap sizes in words for various dimension orders on synthetic data 
(100,000 rows). Zipfian columns have 100 distinct values. Ordering "1234" indicates ordering 
by descending skew (Zipfian) or ascending cardinality (uniform). 



(fc = 1), dimensions should be ordered from least to most skewed, and from 
smallest to largest; whereas the opposite is true for k > 1. 

A sensible heuristic might be to sort columns by increasing density (w 
n i ). However, a very sparse column {n X J k 3> w) will not benefit from sort- 
ing (see Fig. [7]) and should be put last. Hence, we use the following heuris- 
tic: columns are sorted in decreasing order with respect to min(n i , (1 — 
n i 1 ^ k )/(4w — 1)): this function is maximum at density n i 1 ^ k = l/(4w) and it 
goes down to zero as the density goes to 1. In Fig. |8(a)| this heuristic makes 
the best choice for all values of k. We consider this heuristic further in S 17.71 



5.4- Avoiding column order 

Canahuate et al. |10j propose to permute bitmaps individually prior to sort- 
ing, instead of permuting table columns. We compare these two strategies ex- 



perimentally in § 7.9 
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As a practical alternative to lexicographic sort and column (or bitmap) 
reordering, we introduce Frequent-Component (FC) sorting, which uses his- 
tograms to help sort without bias from a fixed dimension ordering. In sorting, 
we compare the frequency of the i th most frequent attribute values in each of 
two rows without regard (except for possible tie-breaking) to which columns 
they come from. For example, consider the following table: 

cat blue 
cat red 
dog green 
cat green 

We have the following (frequency, value) pairs: (l,blue), (l,red), (l,dog), (2, green), 
and (3, cat). For two rows r x and r 2 , <fc first compares (/(ai),<2i) with 
(/(a2),a2), where a\ is the least frequent component in r± and ai is the least 
frequent component in — /(ai) is the frequency of the component a;. Values 
ai and ct2 can be from different columns. Ties are broken by the second-lcast- 
frequent components in t\ and r^, and so forth. Hence, the sorted table in our 
example is 

dog green 
cat blue 
cat red 
cat green. 

With appropriate pre- and post-processing, it is possible to implement FC 
using a standard sorting utility such as Unix sort. First, we sort the compo- 
nents of each row of the table into ascending frequency. In this process, each 
component is replaced by three consecutive components, /(a), a, and pos(a). 
The third component records the column where a was originally found. In our 
example, the table becomes 

(l,dog,l) (2,green,2) 
(l,bluc,2) (3,cat,l) 
(l,red,2) (3,cat,l) 
(2, green, 2) (3,cat,l). 

Lexicographic sorting (via sort or a similar utility) of rows follows, after which 
each row is put back to its original value (by removing f(a) and storing a as 
component pos(a)). 

6. Picking the right fc-of-iV 

Choosing k and N are important decisions. We choose a single k value 
for all dimensions^] leaving the possibility of varying k by dimension as future 



"Except that for columns with small rii, we automatically adjust k downward when it 
exceeds the limits noted at the end of § [2] 
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work. Larger values of k typically lead to a smaller index and a faster con- 
struction time — although we have observed cases where k = 2 makes a larger 
index. However, query times increase with k: there is a construction time/speed 
tradeoff. 



6.1. Larger k makes queries slower 

We can bound the additional cost of queries. Write (~/\ = rij. A given fc-of- 
Li bitmap is the result of an OR operation over at most krii / Li unary bitmaps 
by the following proposition. 

Proposition 5. Ln k-of-N encoding, each attribute value is linked to k bitmaps, 
and each bitmap is linked to at most jjf (^f) attribute values. 

Proof. There are attribute values. Each attribute value is included in 
k bitmaps. The bipartite graph from attribute values to bitmaps has fc(^f) edges. 
There are N bitmaps, hence (^) edges per bitmap. This concludes the proof. 

Moreover, ni = ( ,*) < (e • Li/k) k by a standard inequality, so that Li/k > 
nl /e or fc/Z, < e • n i X ^ k < 3n { 1 . Hence, km/Li < 3n[ k 1 " k , 

Because | Vi-^»l — S;l^l> the expected size of such a k-oi-Li bitmap is 
no larger than 3n[ k 1 times the expected size of a unary bitmap. A query 
looking for one attribute value will have to AND together k of these denser 
bitmaps. The entire ANDing operation can be done (see the end of § [3]) by 
k — 1 pairwise ANDs that produce intermediate results whose EWAH sizes are 
increasingly small: 2k — 1 bitmaps are thus processed. Hence, the expected 
time complexity of an equality query on a dimension of size ni is no more than 

3(2k — 1)71^ k times higher than the expected cost of the same query on a k = 1 
index. (For k large, we may use see Algorithm|3]to substitute log k for the 2k — 1 
factor.) 

For a less pessimistic estimate of this dependence, observe that indexes sel- 
dom increase in size when k grows. We may conservatively assume that index 
size is unchanged when k changes. Therefore, the expected size of one bitmap 
grows as the reciprocal of the number of bitmaps (~ n i 1 ^ k /k), leading to 
queries whose cost is proportional to ~ (2k — l)n j ~ 1/ '' s /A; = (2 — l/A^n" 1 ^. 
Relative to the cost for k — 1 , which is proportional to 1 jni , we can say that 
increasing k leads to queries that are (2 — l/k)n[ k 1 times more expensive 
than on a simple bitmap index. 

For example, suppose rii — 100. Then going from k — 1 to k — 2 should 
increase query cost about 15 fold but no more than 90 fold. In summary, the 
move from k — 1 to anything larger can have a dramatic negative effect on query 
speeds. Once we are at k = 2, the incremental cost of going to k = 3 or k = 4 
is low: whereas the ratio k — 2 : k — 1 goes as y/nl, the ratio k = 3 : k = 2 goes 
as n]^ 6 . We investigate this issue experimentally in § 7.10 
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6.2. When does a larger k make the index smaller? 

Consider the effect of a length 100 run of values v\, followed by 100 rep- 
etitions of V2, then 100 of V3, etc. Regardless of k, whenever we switch from 
v\ to Vi+\ at least two bitmaps will have to make transitions between and 
1. Thus, unless the transition appears at a word boundary, we create at least 
2 dirty words whenever an attribute changes from row to row. The best case, 
where only 2 dirty words are created, is achieved when k — 1 for any assignment 
of bitmap codes to attribute values. For k > 1 and N as small as possible, it 
may be impossible to achieve so few dirty words, or it may require a particular 
assignment of bitmap codes to values. 

Encodings with k > 1 find their use when many (e.g. 15) attribute values 
fall within a word- length boundary. In that case, a k = 1 index will have at least 
15 bitmaps with transitions (and we can anticipate 15 dirty words). However, 
if there were only 45 possible values in the dimension, 10 bitmaps would suffice 
with k = 2. Hence, there would be at most 10 dirty words and maybe less if we 
have sorted the data (see Fig. [6]). 

6.3. Choosing N 

It seems intuitive, having chosen k, to choose N to be as small as possible. 
Yet, we have observed cases where the resulting 2-of-iV indexes are much bigger 
than 1-of-iV indexes. Theoretically, this could be avoided if we allowed larger 
N, because one could aways append an additional 1 to every attribute's 1-of-iV 
code. Since this would create one more (clean) bitmap than the 1-of-iV index 
has, this 2-of-iV index would never be much larger than the 1-of-iV index. So, if 
N is unconstrained, we can see that there is never a significant space advantage 
to choosing k small. 

Nevertheless, the main advantage of k > 1 is fewer bitmaps. We choose ./V 
as small as possible. 

7. Experimental results 

We present experiments to assess the effects of various factors (choices of k, 
sorting approaches, dimension orderings) in terms of EWAH index sizes. These 
factors also affect index creation and query times. We report real wall-clock 
times. 

7.1. Platform 

Our test programs^] were written in C++ and compiled by GNU GCC 4.0.2 
on an Apple Mac Pro with two double-core Intel Xeon processors (2.66 GHz) 
and 2 GiB of RAM. Experiments used a 500 GB SATA Hitachi disk (model 
HDP725050GLA360 [Ml IS]), with average seek time (to read) of 14 ms , average 
rotational latency of 4.2 ms, and capability for sustained transfers at 300MB/s. 



: //code . google . com/p/lemurbitmapindex/ 
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Table 3: Characteristics of data sets used. 





rows 


cols 




size 


Census-Income 


199 523 


42 


103 419 


99.1MB 


4-d projection 


199 523 


4 


102 609 


2.96 MB 


DBGEN 


13 977 980 


1G 


4411936 


1.5GB 


4-d projection 


13 977 980 


4 


402 544 


297 MB 


Netflix 


100 480 507 


4 


500 146 


2.61GB 


KJV-4grams 


877 020 839 


4 


33 553 


21.6 GB 



This disk also has an on-board cache size of 16MB, and is formatted for the 
Mac OS Extended filesystem (journaled). Unless otherwise stated, we use 32-bit 
binaries. Lexicographic sorts of flat files were done using GNU coreutils sort 
version 6.9. For constructing all indexes, we used Algorithm [I] because without 
it, the index creation times were 20-100 times larger, depending on the data 
set. 

7.2. Data sets used 

We primarily used four data sets, whose details are summarized in Table [3] 
Census-Income 41J, DBGEN 02], KJV-4grams, and Netflix 03]. DBGEN is a 
synthetic data set, whereas KJV-4grams is a large list (including duplicates) of 4- 
tuples of words obtained from the verses in the King James Bible [44 , after stem- 
ming with the Porter algorithm [45 and removal of stemmed words with three or 
fewer letters. Occurrence of row wi, u>2, u>3, if 4 indicates that the first paragraph 
of a verse contains words wi through w 4 , in this order. KJV-4grams is motivated 
by research on Data Warehousing applied to text analysis 0B]. Each of column 
of KJV-4grams contains roughly 8 thousand distinct stemmed words. The Net- 
flix table has 4 dimensions: UserlD, MovielD, Date and Rating, having cardi- 
nalities 480 189, 17770, 2 182, and 5. Since the data was originally supplied in 
17 700 small files (one file per film), we concatenated them into a flat file with an 
additional column for the film and randomized the order of its rows using Unix 
commands such as cat -n file.csv I sort — random-sort I cut -f 2-. 
All files were initially randomly shuffled. 

For some of our tests, we chose four dimensions with a wide range of sizes. 
For Census-Income, we chose age (cZi), wage per hour (cfe), dividends from stocks 
(d 3 ) and a numerical valu^Jfound in the 25 th position (g? 4 ). Their respective car- 
dinalities were 91, 1240, 1478 and 99 800. For DBGEN, we selected dimensions 
of cardinality 7, 11, 2 526 and 400 000. Dimensions are numbered by increasing 
size: column 1 has fewest distinct values. 

7. 3. Overview of experiments 

Using our test environment, our experiments assessed 



The associated metadata says this column should be a 10-valued migration code. 
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whether a partial (block-wise) sort could save enough time to justify lower 
quality indexes (§ |7.4|; 



the effect that sorting has on index construction time (§ 7.5 1 
the merits of various code assignments (§ |7.6|; 



whether column ordering (as discussed in § 5.3 1 has a significant effect on 
index size (§ 7.7); 



• whether the index size grows linearly as the data set grows (§ 7.8); 



• whether bitmap reordering is preferable to our column reordering (§ 7.9 1; 

• whether larger k actually gives a dramatic slowdown in query speeds, 
which § 6.1 predicted was possible (§ 7.101; 

• whether word length has a significant effect on the performance of EWAH 

• whether 64-bit indexes are faster than 32-bit index when aggregating many 
bitmaps (§ |7.12 l. 

In all of our experiment involving 32-bit words (our usual case), we choose 
to implement EWAH with 16-bit counters to compress clean words. When 
there are runs with many more than 2 16 clean words, 32-bit EWAH might be 
inefficient. However, on our data sets, no more than 14% of all counters had the 
maximal value on sorted indexes, and no more than 3% on unsorted indexes (see 
Table [4]). Hence, EWAH is less efficient than WAH by a factor of no more than 
14% at storing the clean words. However, EWAH is more efficient than WAH by 
a constant factor of 3% at storing the dirty words. The last column in Table [4] 
shows runs of clean words make up only about half the storage; the rest is made 
of dirty words. For 64-bit indexes, we have not seen any overrun. 



7.4- Sorting disjoint blocks 

Instead of sorting the entire table, we may partition the table horizontally 
into disjoint blocks. Each block can then be sorted lexicographically and the ta- 
ble reconstructed. Given B blocks, the sorting complexity goes from O(nlogn) 
to 0(n\ogn/ B). Furthermore, if blocks are small enough, we can sort in main 
memory Unfortunately, the indexing time and bitmap sizes both substantially 
increase, even with only 5 blocks. (See Table [5]) Altogether, sorting by blocks 
does not seem useful. 

Hence, competitive row-reordering alternatives should be scalable to a large 
number of rows. For example, any heuristic in Q,{n 2 ) is probably irrelevant. 
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Table 4: Percentage of overruns in clean word compression using 32-bit EWAH with unary 
bitmaps and lexicographically sorted tables 



(a) lexicographically sorted 





overruns 


clean runs 
total size 


Census-Income (4-d) 


0% 


60% 


DBGEN (4-d) 


13% 


44% 


Netflix 


14% 


49% 


KJV-4grams 


4.3% 


43% 


(b) unsorted 




overruns 




Census-Income (4-d) 


0% 


52% 


DBGEN (4-d) 


0.2% 


45% 


Netflix 


2.4% 


49% 


KJV-4grams 


0.1% 


47% 



Table 5: Time required to sort and index, and sum of the compressed sizes of the bitmaps, 
for k = 1 (time in seconds and size in MB). Only three columns of each data sets are used 
with cardinalities of 7, 11, 400 000 for DBGEN and of 5, 2 182 and 17 770 for Netflix. 





DBGEN (3d) 


# of blocks 


sort 


fusion indexing 


total 


size 


1 (complete sort) 


31 




65 


96 


39 


5 


28 


2 


68 


98 


51 


10 


24 


3 


70 


99 


58 


500 


17 


3 


87 


107 


116 


no sorting 






100 


100 


119 




Netflix (3d) 


1 (complete sort) 


487 




558 


1045 


129 


5 


360 


85 


572 


1017 


264 


10 


326 


87 


575 


986 


318 


500 


230 


86 


601 


917 


806 


no sorting 






689 


689 


1552 



7.5. Index construction time 

Table [5] shows that sorting may increase the overall index-construction time 
(by 35% for Netflix). While Netflix and DBGEN nearly fit in the machine's 
main memory (2GiB), KJV-4grams is much larger (21.6GB). Constructing a 
simple bitmap index (using Gray-Lex) over KJV-4grams took approximately 
14 000 s or less than four hours. Nearly half (6 000 s) of the time was due to the 
sort utility since the data set exceeds the machine's main memory (21.6 GB 
vs. 2GiB). Constructing an unsorted index is faster (approximately 10 000 s or 
30% less) , but the index is about 9 times larger (see Table [6]) . 

For DBGEN, Netflix and KJV-4grams, the construction of the bitmap index 
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Figure 9: Relative performance, as a function of the number of dimensions, on a Zipfian data 
set. 



itself over the sorted table is faster by at least 20%. This effect is so significant 
over DBGEN that it is faster to first sort prior to indexing. 

7. 6. Sorting 

On some synthetic Zipfian data sets, we found a small improvement (less 
than 4% for 2 dimensions) by using Gray-Lex in preference to Binary-Lex. Our 
data sets have 100 attribute values per dimension, and the frequency of the 
attribute values is Zipfian (proportional to 1/r, where r is the rank of an item). 
Dimensions were independent of one another. See Fig. [9] where we compare 
Binary-Lex to an unsorted table, and then Gray-Lex to Binary-Lex. For the 
latter, the advantage drops quickly with the number of dimensions. For one 
dimension, the performance improvement is 9% for k = 2, but for more than 

2 dimensions, it is less than 2%. On other data sets, Gray-Lex either had no 
effect or a small positive effect. 

Table [6] shows the sum of bitmap sizes using Gray-Lex orderings and Gray- 
Frequency. For comparison, we also used an unsorted table (the code allocation 
should not matter; we used the same code allocation as Binary-Lex), and we used 
a random code assignment with a lexicographically sorted table (Rand-Lex). 
Dimensions were ordered from the largest to the smallest ("4321") except for 
Census-Income where we used the ordering "3214". 

KJV-4grams had a larger index for k — 2 than k = 1. This data set has 
many very long runs of identical attribute values in the first two dimensions, 
and the number of attribute values is modest, compared with the number of 
rows. This is ideal for 1-of-iV. 

For k = 1, as expected, encoding is irrelevant: Rand-Lex, Binary-Lex, Gray- 
Lex, and Gray-Freq have identical results. However, sorting the table lexico- 
graphically is important: the reduction in size of the bitmaps is about 40% for 

3 data sets (Census-Income, DBGEN, Nctflix), and goes up to 90% for KJV- 
4grams. 

For k > 1, Gray-Frequency yields the smallest indexes in Table [6} The differ- 
ence with the second-best, Gray-Lex, can be substantial (25%) but is typically 
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Table 6: Total sizes (words) of 32-bit EWAH bitmaps for various sorting methods. 



k 


Unsorted 


Rand-Lex 


Binary-Lex 


Gray-Lex 


Gray-Freq. 


Census- 


1 


8.49 x 10 5 


4.87 x 10 b 


4.87 x 10 b 


4.87 x 10 b 


4.87 x 10 b 


Income 


2 


9.12 x 10 5 


6.53 x 10 5 


4.53 x 10 5 


4.52 x 10 5 


4.36 x 10 5 


(4d) 


3 


6.90 x 10 5 


4.85 x 10 5 


3.77 x 10 5 


3.73 x 10 5 


3.28 x 10 5 




4 


4.58 x 10 5 


2.74 x 10 5 


2.23 x 10 5 


2.17 x 10 s 


1.98 x 10 s 


DBGEN 


1 


5.48 x 10 Y 


3.38 x 10 Y 


3.38 x 10 Y 


3.38 x 10' 


3.38 x 10' 


(4d) 


2 


7.13 x 10 7 


2.90 x 10 7 


2.76 x 10 7 


2.76 x 10 7 


2.74 x 10 7 




3 


5.25 x 10 7 


1.73 x 10 7 


1.51 x 10 7 


1.50 x 10 7 


1.50 x 10 7 




4 


3.24 x 10 7 


1.52 x 10 7 


1.21 x 10 7 


1.21 x 10 7 


1.19 x 10 7 


Netflix 


1 


6.20 x 10 s 


3.22 x 10 8 


3.22 x 10 s 


3.22 x 10 s 


3.19 x 10 8 




2 


8.27 x 10 8 


4.18 x 10 8 


3.17 x 10 s 


3.17 x 10 s 


2.43 x 10 s 




3 


5.73 x 10 8 


2.40 x 10 8 


1.98 x 10 s 


1.97 x 10 s 


1.49 x 10 s 




4 


3.42 x 10 8 


1.60 x 10 8 


1.39 x 10 s 


1.37 x 10 s 


1.14 x 10 8 


KJV- 


1 


6.08 x 10 y 


6.68 x 10 8 


6.68 x 10 s 


6.68 x 10 8 


6.68 x 10 8 


4grams 


2 


8.02 x 10 9 


1.09 x 10 9 


1.01 x 10 9 


9.93 x 10 s 


7.29 x 10 8 




3 


4.13 x 10 9 


9.20 x 10 8 


8.34 x 10 s 


8.31 x 10 s 


5.77 x 10 s 




4 


2.52 x 10 9 


7.23 x 10 s 


6.49 x 10 s 


6.39 x 10 s 


5.01 x 10 s 



small. However, Gray-Frequency is histogram-aware and thus, more complex 
to implement. The difference between Gray-Lex and Binary-Lex is small even 
though Gray-Lex is sometimes slightly better («2%) especially for denser in- 
dexes (k — 4). However, Rand-Lex is noticeably worse (up to «25%) than 
both of them: this means that encoding is a significant issue. All three schemes 
(Binary-Lex, Gray-Lex, Rand-Lex) have about the same complexity — all three 
are histogram-oblivious — and therefore Gray-Lex is recommended. 

We omit Frequent-Component from the table. On Netflix, for fc = 1, it 
outperformed the other approaches by 1%, and for DBGEN it was only slightly 
worse than the others. But in all other case on DBGEN, Census-Income and 
Netflix, it lead to indexes 5-50% larger. (For instance, on Netflix (k = 4) the 
index size was 1.52 x 10 s words, barely better than Rand-Lex and substantially 
worse than Gray- Frequency.) Because it interleaves attribute values and it is 
histogram-aware, it may be the most difficult scheme to implement efficiently 
among our candidates. Hence, we recommend against Frequent-Component. 

7. 7. Column effects 

We experimentally evaluated how lexicographic sorting affects the EWAH 
compression of individual columns. Whereas sorting tends to create runs of 
identical values in the first columns, the benefits of sorting are far less apparent 
in later columns, except those strongly correlated with the first few columns. 
For Table (7) we have sorted projections of Census-Income and DBGEN onto 
10 dimensions d\ . . . d w with ni < . . . < n 10 . (The dimensions di ■ ■ ■ d± in this 
group are different from the dimensions d\. . .d± discussed earlier.) We see that 
if we sort from the largest column (dio . . . d\), at most 3 columns benefit from 
the sort, whereas 5 or more columns benefit when sorting from the smallest 
column (di . . . dio). 
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Table 7: Number of 32-bit words used for different unary indexes when the table was sorted 
lexicographically (dimensions ordered by descending cardinality, dio . . . d\ , or by ascending 
cardinality, d\ . . . dio). 



(a) Census-Income 





cardinality 


unsorted 


di . . . d w 


dio • • • di 


di 


7 


42 427 


32 


42 309 


d 2 


8 


36 980 


200 


36 521 


d 3 


10 


34 257 


1215 


28 975 




47 


0.13xl0 6 


12118 


0.13xl0 6 


d 5 


51 


35 203 


17 789 


28 803 


d e 


91 


0.27xl0 6 


75 065 


0.25xl0 6 


d 7 


113 


12199 


9217 


12178 


dg 


132 


20 028 


14062 


19917 


d 9 


1240 


29 223 


24 313 


28 673 


dm 


99 800 


0.50xl0 6 


0.48xl0 6 


0.30xl0 6 


total 




l.llxlO 6 


0.64xlO B 


0.87xlO t> 



(b) DBGEN 





cardinality 


unsorted 


d\ . . . d w 


dio . . . di 


di 


2 


0.75xlO a 


24 


0.75xlO b 


d 2 


3 


l.llxlO 6 


38 


l.llxlO 6 


d 3 


7 


2.58xl0 6 


150 


2.78xl0 6 




9 


0.37xl0 6 


1006 


3.37xl0 6 


d 5 


11 


4.11xl0 6 


10 824 


4.11xl0 6 


d 6 


50 


13.60xl0 6 


0.44xl0 6 


1.42xl0 6 


d 7 


2 526 


23.69xl0 6 


22.41 xlO 6 


23.69xl0 6 


d s 


20000 


24.00xl0 6 


24.00 xlO 6 


22.12xl0 6 


dg 


400 000 


24.84xl0 6 


24.84xl0 6 


19.14xl0 6 


dio 


984 297 


27.36xl0 6 


27.31 xlO 6 


0.88xl0 6 


total 




0.122xlO y 


0.099 xlO 9 


0.079 x 10 y 



We also assessed how the total size of the index was affected by various 
column ordcrings; we show the Gray-Lex index sizes for each column ordering 
in Fig. [lO] The dimensions of KJV-4grams are too similar for ordering to be 
interesting, and we have omitted them. For small dimensions, the value of 
k was lowered using the heuristic presented in § [2j Our results suggest that 
table-column reordering has a significant effect (40%). 

The value of k affects which ordering leads to the smallest index: good 
orderings for k = 1 are frequently bad orderings for k > 1, and vice versa. 
This is consistent with our earlier analysis (see Figs. [7] and [8]). For Netflix and 
DBGEN, we have omitted k = 2 for legibility 

Census-Income's largest dimension is very large (114 rj n/2); DBGEN also 
has a large dimension (714 ss n/35). Sorting columns in decreasing order with 
respect to min(n i 1 ^ fc , (1 — n i 1 ^ k )/(4w — 1)) for k = 1, we have that only for 
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550000 
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(a) Census-Income 





(b) DBGEN 



(c) Netfiix 



Figure 10: Sum of EWAH bitmap sizes (words, y axis) on 4-d data sets for all dimension 
orderings (x axis). 



DBGEN the ordering "2134" is suggested, otherwise, "1234" (from smallest to 
largest) is recommended. Thus the heuristic provides nearly optimal recom- 
mendations. For k — 3 and k — 4, the ordering "1234" is recommended for all 
data sets: for k = 4 and Census-Income, this recommendation is wrong. For 
k = 2 and Census-Income, the ordering "3214" is recommended, another wrong 
recommendation for this data set. Hence, a better column reordering heuristic 
is needed for k > 1. Our greedy approach may be too simple, and it may be 
necessary to know the histogram skews. 



7.8. Index size growth 

To study scaling, we built indexes from prefixes of the full KJV-4grams data 
set. We found that the sum of the EWAH bitmap sizes (see Fig. 11 1 increased 
linearly. Yet with sorting, the bitmap sizes increased sublinearly. As new data 
arrives, it is increasingly likely to fit into existing runs, once sorted. Hence — 
everything else being equal — sorting becomes more beneficial as the data sets 
grow. 
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Figure 11: Sum of the EWAH bitmap sizes for various prefixes of the KJV-4grams table 
(fe = l) 



7.9. Bitmap reordering 

Sharma and Goyal [7j consider encoding a table into a bitmap index using 
a multi-component code (similar to k-of-N), then GC sorting the rows of the 
index, and finally applying WAH compression. Canahuate et al. |10j propose a 
similar approach, with the additional step of permuting the columns — meaning 
the individual bitmaps — in the index prior to GC sorting. For example, whereas 
the list of 2-of-4 codes in increasing GC order is 0011, 0110, 0101, 1100, 1010, 
1001, by permuting the first and the last bit, we obtain the following (non- 
standard) Gray code: 1010, 0110, 1100, 0101, 0011, 1001. In effect, reordering 
bitmaps is equivalent to sorting the (unpermitted) index rows according to a 
non-standard Gray code. They chose to use bitmap density to determine which 
index columns should come first, but reported that the different orders had little 
effect on the final index sizes. 

In contrast, our approach has been to permute the columns of the table — not 
the individual bitmaps, then sort the table lexicographically, and finally gener- 
ate the compressed index. Permuting the attributes corresponds to permuting 
blocks of bitmaps: our bitmap permutations are a special case of Canahaute's. 
We do not know a sufficiently efficient method to sort our largest data sets with 
arbitrary bitmap reordering. We cannot construct the uncompressed index: for 
KJV-4grams, we would require at least 3.7TB. Instead, we used the compressed 
B-tree approach mentioned in § |5.1| and applied the bitmap permutation to its 
keys. This was about 100 times slower than our normal Gray-Lex method, and 
implementation restrictions prevented our processing the full Netflix or KJV- 
4grams data sets. Hence, we took the first 20 million records from each of these 
two data sets, forming Netflix20M and KJV20M. 

Our experiments showed that little compression was lost by restricting our- 
selves to the special case of permuting table columns, rather than individual 
bitmaps. While we indexed all the 4! = 24 tables generated by all column 
permutations in our 4-column data sets, it is infeasible to consider all bitmap 
permutations. Even if there were only 100 bitmaps, the number of permuta- 
tions would be prohibitively large (100! « 10 158 ). We considered three heuristics 
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Table 8: Sum of the EWAH bitmap sizes (in words), GC sorting and various bitmap orders 





Best column 
order 


Per-bitmap reordering 

tt~> tv r n ~n nn 

IF MSF SF 


Census-Income k — 1 
(4d) 2 
3 

A 

£± 


4.87 x 10 b 
3.74 x 10 5 
2.99 x 10 5 

1 Qd v 1 fl S 
A 1U 


4.91 x 10 5 4.91 x 10 b 6.18 x 10° 
4.69 x 10 5 4.10 x 10 s 3.97 x 10 5 
3.83 x 10 5 3.00 x 10 s 3.77 x 10 5 
^ 09 v in 5 i qi v i n 5 i qi v i n 5 


DBGEN 1 
(4d) 2 
3 

4 


2.51 x 10 7 
2.76 x 10 7 
1.50 x 10 7 
1.21 x 10 7 


2.51 x 10 7 2.51 x 10 7 3.39 x 10 7 
4.50 x 10 7 4.35 x 10 7 2.76 x 10 7 
3.80 x 10 7 1.50 x 10 7 1.50 x 10 7 
2.18 x 10 7 1.21 x 10 7 1.21 x 10 7 


Netflix20M 1 
2 
3 
4 


5.48 x 10 7 
7.62 x 10 7 
4.43 x 10 7 
2.99 x 10 7 


5.87 x 10 7 5.87 x 10 7 6.63 x 10 7 
9.05 x 10 7 8.61 x 10 7 7.64 x 10 7 
7.99 x 10 7 4.39 x 10 7 4.39 x 10 7 
4.82 x 10 7 3.00 x 10 7 3.00 x 10 7 


KJV20M 1 
2 
3 
4 


4.06 x 10 7 
5.77 x 10 7 
3.95 x 10 7 
2.72 x 10 7 


4.85 x 10 7 4.83 x 10 7 3.85 x 10 7 

6.46 x 10 7 5.73 x 10 7 5.73 x 10 7 

4.47 x 10 7 4.24 x 10 7 4.24 x 10 7 
3.42 x 10 7 3.38 x 10 7 3.38 x 10 7 



based on bitmap density T> — the number of 1-bits over the total number of bits 
(«): 

1. "Incompressible first" (IF), which orders bitmaps by increasing | T> — 0.5 |. 
In other words, bitmaps with density near 0.5 are first [10 . 

2. "Moderately sparse first" (MSF), ordering by the value min(Z), 4 ^~ 2 "^ 1 ) as 
discussed at the end of § |5.3| This is a per-bitmap variant of the column- 
reordering heuristic we evaluate experimentally in § |7.7| 

3. "Sparse first" (SF): order by increasing V. 

Results are shown in Table [8] In only one case (KJV20M, k — 1), was a 
per-bitmap result significantly better (by 5%) than our default method of rear- 
ranging table columns instead of individual bitmaps. In most other cases, all 
per-bitmap reorderings were worse, sometimes by large factors (30%). 

IF ordering performs poorly when there are some dense bitmaps (i.e., when 
k > 1.) Likewise, SF performs poorly for sparse bitmaps (k = 1). We do not 
confirm prior reports |10j that index column order has relatively little effect 
on the index size: on our data, it makes a substantial difference. Perhaps the 
characteristics of their scientific data sets account for this difference. 

7.10. Queries 

We implemented queries over the bitmap indexes by processing the logical 
operations two bitmaps at a time: we did not use Algorithm [3] Bitmaps are 
processed in sequential order, without sorting by size, for example. The query 
processing costs includes the extraction of the row IDs — the location of the 
1-bits — from the bitmap form of the result. 
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We timed equality queries against our 4-d bitmap indexes. Recall that di- 
mensions were ordered from the largest to the smallest (4321) except for Census- 
Income where we used the ordering "3214." Gray-Lex encoding is used for k > 1. 
Queries were generated by choosing attribute values uniformly at random and 
the figures report average wall-clock times for such queries. We made 100 ran- 
dom choices per column for K JV-4grams when k > 1 . For DBGEN and Netflix, 
we had 1 000 random choices per column and 10 000 random choices were used 
for Census-Income and KJV-4grams (k — 1). For each data set, we give the 
results per column (leftmost tick is the column used as the primary sort key, 



next tick is for the secondary sort key, etc.). The results are shown in Fig. 12 

From Fig. |12(b)| we see that simple bitmap indexes almost always yield the 
fastest queries. The difference caused by k is highly dependent upon the data 
set and the particular column in the data set. However, for a given data set 
and column, with only a few small exceptions, query times increase with k, 
especially from k = 1 to k = 2. For DBGEN, the last two dimensions have size 
7 and 11, whereas for Netflix, the last dimension has size 5, and therefore, they 
will never use a fc-value larger than 2: their speed is mostly oblivious to k. 

An exception occurs for the first dimension of Netflix, and it illustrates the 
importance of benchmarking with large data sets. Note that using k = 1 is much 
slower than using k > 1. However, these tests were done using a disk whose 
access typicalljj^] requires at least 18 ms. In other words, any query answered 
substantially faster than 20 ms was answered without retrieving data from the 
disk platter (presumably, it came from the operating system's cache, or perhaps 
the disk's cache). For k > 1, it appears that the portion of the index for the 
first attribute (which is ss 7 MBp"| could be cached successfully, whereas for 
k = 1, the portion of the index was 100 MEp"| and could not be cached. 

In § [6] we predicted that the query time would grow with k as ~ (2 — 
l/k)n i . for the large dimensions such as the largest ones for DBGEN (400k) 
and Netflix (480k), query times are indeed significantly slower for k — 2 as 
opposed to k = 1. However, our model exaggerates the differences by an order 
of magnitude. The most plausible explanation is that query times are not 
proportional to the sizes of the bitmap loaded, but also include a constant 
factor. This may correspond to disk access times. 

Fig. |12(a) and |12(b)| also show the equality query times per column before 



and after sorting the tables. Sorting improves query times most for larger values 
of k: for Netflix, sorting improved the query times by 



• at most 2 for k = 1, 

• at most 50 for k = 2, 



9 This is perhaps pessimistic, as an operating system may be able to cluster portions of 
the index for a given dimension onto a small number of adjacent tracks, thereby reducing seek 
times. 

10 For k = 2 we have 981 bitmaps and (see Figure 13k about 7 kB per bitmap. 
11 The half-million bitmaps had an average size or about 200 bytes. 
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Figure 12: Query times are affected by dimension, table sorting and k. 



• and at most 120 for k = 3. 

This is consistent with our earlier observation that indexes with k > 1 benefit 
from sorting even when there are no long runs of identical values (see 



5.11 



(On the first columns, k — 3 usually gets the best improvements from sorting.) 
The synthetic data set DBGEN showed no significant speedup from sorting, 
beyond its large first column. Although Netflix, like DBGEN, has a many- 
valued column first, it shows a benefit from sorting even in its third column: in 
fact, the third column benefits more from sorting than the second column. The 
largest table, KJV-4grams, benefited most from the sort: while queries on the 
last column are up to 10 times faster, the gain on the first two columns ranges 
from 125 times faster (k = 1) to almost 3 300 times faster (k = 3). 

We can compare these times with the expected amount of data scanned 



per query. This is shown in Fig. 13 and we observe some agreement between 
most query times and the expected sizes of the bitmaps being scanned. The 
most notable exceptions are for k = 1; in many such cases we must make an 
expensive seek far into a file for a very small compressed bitmap. Moreover, 
a small compressed bitmap may, via long runs of lxll clean words, represent 
many row IDs. To answer the query, we must still produce the set of row IDs. 

7.11. Effect of the word length 

Our experiments so far use 32-bit EWAH. To investigate the effect of word 
length, we recompiled our executables as 64-bit binaries and implemented 16- 
bit and 64-bit EWAH. The index sizes are reported in Table [9] — the index size 
excludes a B-Tree storing maps from attribute values to bitmaps. We make the 
following observations: 

• 16-bit indexes can be 10 times larger than 32-bit indexes. 

• 64-bit indexes are nearly twice as large as 32-bit indexes. 

• Sorting benefits 32-bit and 64-bit indexes equally; 16-bit indexes do not 
benefit from sorting. 
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Figure 13: Bitmap data examined per equality query. 



Table 9: Index size (file size in MB) for unary bitmap indexes (k = 1) under various word 
lengths. For Census-Income and DBGEN, the 4-d projection is used. 



(a) Unsorted 





index size (MB) 


word lenj 


5th 


Census-Income DBGEN Netflix 


KJV-4grams 


16 




12.0 2.5xl0 3 2.6xl0 4 


2.6xl0 4 


32 




3.8 221 2.5xl0 3 


2.4xl0 4 


64 




6.5 416 4.8xl0 3 


4.4xl0 4 






(b) Lexicographically sorted 








index size (MB) 


word leng 


th 


Census-Income DBGEN Netflix 


KJV-4grams 


16 




11.1 2.4xl0 3 2.5xl0 4 


1.6xl0 4 


32 




2.9 137 1.3xl0 3 


2.6xl0 3 


64 




4.8 227 2.2xl0 3 


4.3 xlO 3 



Despite the large variations in file sizes, the difference between index construc- 
tion times (omitted) in 32-bit and 64-bit indexes is within 5%. Hence, index 
construction is not bound by disk I/O performance. 

7.12. Range queries 

Unary bitmap indexes may not be ideally suited for all ranges queries |22j . 
However, range queries are good stress tests: they require loading and comput- 
ing numerous bitmaps. Our goal is to survey the effect of sorting and word 
length on the aggregation of many bitmaps. 

We implemented range queries using the following simple algorithm: 

1. For each dimension, we compute the logical OR of all matching bitmaps. 
We aggregate the bitmaps two at time: ((B\ V B%) V B3) V B4) . . . When 
there are many bitmaps, Algorithm [3] or an in-place algorithm might be 
faster. (See Wu et al. [3j |27] for a detailed comparison of pair-at-a-time 
versus in-place processing.) 
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2. We compute the logical AND of all the dimensional bitmaps — resulting 
from the previous step. 

We implemented a flag to disable the aggregation of the bitmaps to measure 
solely the cost of loading the bitmaps in memory. (Our implementation does 
not write its temporary results to disk.) We omitted 16-bit EWAH from our 
tests due to its poor compression rate. 

As a basis for comparison, we also implemented range queries using uncom- 
pressed external-memory B-tree [35 indexes over each column: the index maps 
values to corresponding row IDs. The computation is implemented as with the 
bitmaps, using the STL functions set -intersection and set .union. We required 
row IDs to be provided in sorted order. All query times were at least an order of 
magnitude larger than with 32-bit or 64-bit bitmap indexes. We failed to index 
the columns with uncompressed B-trees in a reasonable time (a week) over the 
KJV-4grams data set due to the large file size (21.6 GB). 

We generated a set of uniformly randomly distributed 4-d range queries using 
no more than 100 bitmaps per dimension. We used the same set of queries for 
all indexes. The results are presented in Table [l0| Our implementation of range 
queries using uncompressed B-tree indexes is an order of magnitude slower than 
the bitmap indexes over Netflix, hence we omit the results. 

The disk I/O can be nearly twice as slow with 64-bit indexes and KJV- 
4grams. However, disk I/O is negligible, accounting for about 1% of the total 
time. 

The 64-bit indexes are nearly twice as large. We expect that 64-bit indexes 
also generate larger intermediate bitmaps during the computation. Yet, the 64- 
bit indexes have faster overall performance: 40% for DBGEN and 5% for other 
cases, except for sorted KJV-4grams where the gain was 18%. Moreover, the 
benefits of 64-bit indexes are present in both sorted and unsorted indexes. 

8. Guidelines for k 

Our experiments indicate that simple (k = 1) bitmap encoding is preferable 
when storage space and index-creation time are less important than fast equality 
queries. The storage and index-creation penalties are kept modest by table 
sorting and Algorithm [T] 

Space requirements can be reduced by choosing k > 1 , although Table [6] 
shows that this approach has risks (see KJV-4grams). For k > 1, we can gain 
additional index size reduction at the cost of longer index construction by using 
Gray-Frequency rather than Gray-Lex. 

If the total number of attribute values is small relative to the number of 
rows, then we should first try the k = 1 index. Perhaps the data set resembles 
KJV-4grams. Besides yielding faster queries, the k = 1 index may be smaller. 

9. Conclusion and future work 

We showed that while sorting improves bitmap indexes, we can improve them 
even more (30-40%) if we know the number of distinct values in each column. 
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Table 10: Average 4-d range query processing time over the Netflix data set for unary bitmap 
indexes (k = 1) under various word lengths and dimensional B-tree indexes. 



(a) Average wall-clock query time (s) 



DBGEN 


unsorted 


lexicographically sorted 


32-bit EWAH 


0.382 


0.378 


64-bit EWAH 


0.273 


0.265 


Netflix 




32-bit EWAH 


2.87 


1.50 


64-bit EWAH 


2.67 


1.42 


KJV-4grams 




32-bit EWAH 


44.8 


5.2 


64-bit EWAH 


42.4 


4.4 




(b) Average disk I/O time (s) 


DBGEN 


unsorted 


lexicographically sorted 


32-bit EWAH 


0.023 


0.023 


64-bit EWAH 


0.027 


0.026 


Netflix 




32-bit EWAH 


0.11 


0.078 


64-bit EWAH 


0.16 


0.097 


KJV-4grams 




32-bit EWAH 


0.57 


0.06 


64-bit EWAH 


1.11 


0.1 



For k-oi-N encodings with k > 1, even further gains (10-30%) are possible using 
the frequency of each value. Regarding future work, the accurate mathematical 
modelling of compressed bitmap indexes remains an open problem. While we 
only investigated bitmap indexes, we can generalize this work in the context of 
column-oriented databases [U1I3H] by allowing various types of indexes. 
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